Performance of Gene Name Recognition Tools on Patents

نویسندگان

Maryam Habibi

David Luis Wiegandt

Florian Schmedding

Ulf Leser

چکیده

The accurate identification of gene and protein names in patents is an essential step in many commercially highly relevant applications, such as patent retrieval, prior art search, or patent classification. Since patents exhibit a number of properties that make them quite different from scientific articles, it is questionable whether tools developed for the latter sort of texts will work equally well for the former. Answering this question is aggravated by the fact that only few annotated patent corpora exist which makes training hard. In this paper, we report on a comparative evaluation of four existing gene/protein named entity recognition and normalization tools trained on scientific articles regarding their performance on the two patent corpora. We analyze the tools with respect to different evaluation metrics to highlight their respective strengths and limitations. Our results reveal that the performances of these tools over patents are generally lower than for scientific articles. Exemplified by one of the four tools, we also show that training on annotated patents considerably improves performance on patent corpora. We conclude that more efforts must be taken to produce adequate training data for working with patents. keywords: Patent Mining, Named Entity Recognition, Named Entity Normalization, Gene and Protein Entities, Performance Measurements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adapting ChER for the recognition of chemical mentions in patents

ChER (Chemical Entity Recogniser) is a pipeline of natural language processing tools optimised for the recognition of chemical names in scientific abstracts. It formed the basis of our submissions to the previous edition of the CHEMDNER track in BioCreative IV, and was one of the top-performing systems both for the chemical document indexing (CDI) and chemical entity mention recognition (CEM) s...

متن کامل

Mining Patents with tmChem, GNormPlus and an Ensemble of Open Systems

The significant amount of medicinal chemistry information contained in patents make them an attractive target for text mining. The CHEMDNER task at BioCreative V focused on information extraction from patents. This manuscript describes our submissions to the CEMP (chemical named entity recognition) and GPRO (gene and related object identification) subtasks. Our CEMP submission is an ensemble of...

متن کامل

Neji: Recognition of Chemical and Gene Mentions in Patent Texts

The BioCreative V.5 challenge focused on the recognition of chemicals and gene mentions in medicinal chemistry patents. For participation in the chemical entity (CEMP) and gene and protein (GPRO) recognition tasks, we used the concept recognition framework Neji and applied a machine-learning strategy using a optimized feature set. Our best submissions achieved an F-score of 86.6% for the identi...

متن کامل

Identification of chemical and gene mentions in patent texts using feature-rich conditional random fields

This article describes the application of Neji, a text-processing and concept recognition framework, to the automatic recognition of chemicals and gene mentions in medicinal chemistry patents. We used conditional random fields models trained with a otimized set of features including linguistic, orthographic, morphological, dictionary matching and local context features, dictionary-matching, and...

متن کامل

Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: the CEMP and GPRO patents tracks

This paper presents the results of the BioCreative V.5 offline tasks related to the evaluation of the performance as well as assess progress made by strategies used for the automatic recognition of mentions of chemical names and gene in running text of medicinal chemistry patent abstracts. A total of 21 teams submitted results for at least one of these tasks. The CEMP (chemical entity mention i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Performance of Gene Name Recognition Tools on Patents

نویسندگان

چکیده

منابع مشابه

Adapting ChER for the recognition of chemical mentions in patents

Mining Patents with tmChem, GNormPlus and an Ensemble of Open Systems

Neji: Recognition of Chemical and Gene Mentions in Patent Texts

Identification of chemical and gene mentions in patent texts using feature-rich conditional random fields

Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: the CEMP and GPRO patents tracks

عنوان ژورنال:

اشتراک گذاری